{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SdNGfEo2u-r7"
   },
   "source": [
    "# Ungraded Lab: Tokenizing the Sarcasm Dataset\n",
    "\n",
    "In this lab, you will apply what you've learned in the past two exercises to preprocess the [News Headlines Dataset for Sarcasm Detection](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection). This contains news headlines which are labeled as sarcastic or not. You will revisit this dataset in later labs so it is good to be acquainted with it now."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "O1c0PdMndNKY"
   },
   "source": [
    "## Imports\n",
    "\n",
    "Let's start by importing the packages and methods you will use in this lab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "m19wi0ehcoqh"
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import json\n",
    "import tensorflow_datasets as tfds\n",
    "from tensorflow.keras.utils import pad_sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Twhyfjg0xTkg"
   },
   "source": [
    "## Download and inspect the dataset\n",
    "\n",
    "Then, you will fetch the dataset and preview some of its elements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "33W129a7xgoJ"
   },
   "outputs": [],
   "source": [
    "# Download the dataset\n",
    "!wget -nc https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zJHdzh9FyWa2"
   },
   "source": [
    "The dataset is saved as a [JSON](https://www.json.org/json-en.html) file and you can use Python's [`json`](https://docs.python.org/3/library/json.html) module to load it into your workspace. The cell below unpacks the JSON file into a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "OkaBMeNDwMel"
   },
   "outputs": [],
   "source": [
    "# Load the JSON file\n",
    "with open(\"./sarcasm.json\", 'r') as f:\n",
    "    datastore = json.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "D2aSBvJVzRNV"
   },
   "source": [
    "You can inspect a few of the elements in the list. You will notice that each element consists of a dictionary with a URL link, the actual headline, and a label named `is_sarcastic`. Printed below are two elements with contrasting labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "RiiFcWU2xnMJ"
   },
   "outputs": [],
   "source": [
    "# Non-sarcastic headline\n",
    "print(datastore[0])\n",
    "\n",
    "# Sarcastic headline\n",
    "print(datastore[20000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dPuH0bBiz8LJ"
   },
   "source": [
    "With that, you can collect the headlines because those are the string inputs that you will preprocess into numeric features.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "9pxLUQJCxkNB"
   },
   "outputs": [],
   "source": [
    "# Append the headline elements into the list\n",
    "sentences = [item['headline'] for item in datastore]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lBHSXJ5V0qqK"
   },
   "source": [
    "## Preprocessing the headlines\n",
    "\n",
    "You can convert the sentences list above into padded sequences by using the same methods you've been using in the previous labs. The cells below will build the vocabulary, then use that to generate the list of post-padded sequences for each of the 26,709 headlines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "enHwI8WwLyyl"
   },
   "outputs": [],
   "source": [
    "# Instantiate the layer\n",
    "vectorize_layer = tf.keras.layers.TextVectorization()\n",
    "\n",
    "# Build the vocabulary\n",
    "vectorize_layer.adapt(sentences)\n",
    "\n",
    "# Apply the layer for post padding\n",
    "post_padded_sequences = vectorize_layer(sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2VbRue1ySFzo"
   },
   "source": [
    "You can view the results for a particular headline by changing the value of `index` below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "9p_iwAbrJV_Z"
   },
   "outputs": [],
   "source": [
    "# Print a sample headline and sequence\n",
    "index = 2\n",
    "print(f'sample headline: {sentences[index]}')\n",
    "print(f'padded sequence: {post_padded_sequences[index]}')\n",
    "print()\n",
    "\n",
    "# Print dimensions of padded sequences\n",
    "print(f'shape of padded sequences: {post_padded_sequences.shape}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cguoGA9veJLN"
   },
   "source": [
    "For prepadding, you have to setup the `TextVectorization` layer differently. You don't want to have the automatic postpadding shown above, and instead have sequences with variable length. Then, you will pass it to the `pad_sequences()` utility function you used in the previous lab. The cells below show one way to do it:\n",
    "\n",
    "* First, you will initialize the `TextVectorization` layer and set its `ragged` flag to `True`. This will result in a [ragged tensor](https://www.tensorflow.org/guide/ragged_tensor) which simply means a tensor with variable-length elements. The sequences will indeed have different lengths after removing the zeroes, thus you will need the ragged tensor to contain them.\n",
    "\n",
    "* Like before, you will use the layer's `adapt()` method to generate a vocabulary.\n",
    "\n",
    "* Then, you will apply the layer to the string sentences to generate the integer sequences. As mentioned, this will not be post-padded.\n",
    "\n",
    "* Lastly, you will pass this ragged tensor to the [pad_sequences()](https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences) function to generate pre-padded sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "QxHarVDjws5b"
   },
   "outputs": [],
   "source": [
    "# Instantiate the layer and set the `ragged` flag to `True`\n",
    "vectorize_layer = tf.keras.layers.TextVectorization(ragged=True)\n",
    "\n",
    "# Build the vocabulary\n",
    "vectorize_layer.adapt(sentences)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tNlmYjgUx-qk"
   },
   "outputs": [],
   "source": [
    "# Apply the layer to generate a ragged tensor\n",
    "ragged_sequences = vectorize_layer(sentences)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vZnXkNILyAVb"
   },
   "outputs": [],
   "source": [
    "# Print a sample headline and sequence\n",
    "index = 2\n",
    "print(f'sample headline: {sentences[index]}')\n",
    "print(f'padded sequence: {ragged_sequences[index]}')\n",
    "print()\n",
    "\n",
    "# Print dimensions of padded sequences\n",
    "print(f'shape of padded sequences: {ragged_sequences.shape}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "TNkugcfkPv1w"
   },
   "outputs": [],
   "source": [
    "# Apply pre-padding to the ragged tensor\n",
    "pre_padded_sequences = pad_sequences(ragged_sequences.numpy())\n",
    "\n",
    "# Preview the result for the 2nd sequence\n",
    "pre_padded_sequences[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "izkWrVNKQwJQ"
   },
   "source": [
    "You can see the results for post-padded and pre-padded sequences by changing the value of `index` below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "MtgG2CMtPbN2"
   },
   "outputs": [],
   "source": [
    "# Print a sample headline and sequence\n",
    "index = 2\n",
    "print(f'sample headline: {sentences[index]}')\n",
    "print()\n",
    "print(f'post-padded sequence: {post_padded_sequences[index]}')\n",
    "print()\n",
    "print(f'pre-padded sequence: {pre_padded_sequences[index]}')\n",
    "print()\n",
    "\n",
    "# Print dimensions of padded sequences\n",
    "print(f'shape of post-padded sequences: {post_padded_sequences.shape}')\n",
    "print(f'shape of pre-padded sequences: {pre_padded_sequences.shape}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4wyLF5T036W8"
   },
   "source": [
    "This concludes the short demo on text data preprocessing on a relatively large dataset. Next week, you will start building models that can be trained on these output sequences. See you there!"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "private_outputs": true,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
